Methods for rapid identification and quantitation of nucleic acid variants

ABSTRACT

There is a need for nucleic acid analysis which is both specific and rapid, and in which no nucleic acid sequencing is required. The present invention addresses this need, among others by providing a method of nucleic acid amplification of overlapping sub-segments of a nucleic acid followed by molecular mass measurement of resulting amplification products by mass spectrometry, and determination of the base compositions of the amplification products.

RELATED APPLICATIONS

The present application is a continuation of U.S. application Ser. No. 11/491,376, filed Jul. 21, 2006, which claims the benefit of priority to U.S. Provisional Application Ser. No. 60/701,404, filed Jul. 21, 2005; to U.S. Provisional Application Ser. No. 60/771,101, filed Feb. 6, 2006; and to U.S. Provisional Application Ser. No. 60/747,607 filed May 18, 2006. Each of the above listed applications is incorporated herein by reference in its entirety. Methods disclosed in U.S. application Ser. Nos. 10/156,608, 09/891,793, 10/418,514, 10/660,997, 10/660,122, 10,660,996, 10/660,998, 10/728,486, 10/405,756, 10/853,660, 11/060,135, 11/073,362 and 11/209,439, are commonly owned and incorporated herein by reference in their entirety for any purpose.

SEQUENCE LISTING

A paper copy of the sequence listing and a computer-readable form of the sequence listing, on diskette, containing the file named DIBIS0075US.C1SEQ.txt, which was created on Nov. 11, 2009, are herein incorporated by reference.

FIELD OF THE INVENTION

The present invention relates generally to the field of nucleic acid analysis and provides methods, compositions and kits useful for this purpose when combined with mass spectrometry.

BACKGROUND OF THE INVENTION

Characterization of nucleic acid variants is a problem of great importance in various fields of molecular biology such as, for example, genotyping and identification of strains of bacteria and viruses which are subject to evolutionary pressures via mechanisms including mutation, natural selection, ge drift and recombination. Nucleic acid heterogeneity is a common feature of RNA viruses, for example. Populations of RNA viruses often exhibit high levels of heterogeneity due to mutations which enhance the ability of the viruses to adapt to growth conditions. Mixed populations of RNA virus quasispecies are known to exist in viral vaccines. It would be advantageous to have a method for monitoring the heterogeneity of viral vaccines. Likewise, new strains of bacterial species are also known to evolve rapidly.

Characterization and quantitiation of newly-evolving bacteria and viruses such as the SARS coronavirus, for example, is typically the first step in containment of an epidemic or infectious disease outbreak. In addition to characterization of naturally occurring variants of bacteria and viruses, there is a need for characterization of genetically engineered bacterial or viral bio-weapons in forensic or bio-warfare investigations. Unfortunately, the process of sequencing entire bacterial or viral genomes or vaccine vector sequences is time consuming and is not effective at resolving mixtures of nucleic acid variants.

Mitochondrial DNA is found in eukaryotes and differs from nuclear DNA in its location, its sequence, its quantity in the cell, and its mode of inheritance. The nucleus of the human cell contains two sets of 23 chromosomes, one paternal set and one maternal set. However, cells may contain hundreds to thousands of mitochondria, each of which may contain several copies of mitochondrial DNA. Nuclear DNA has many more bases than mitochondrial DNA, but mitochondrial DNA is present in many more copies than nuclear DNA. This characteristic of mitochondrial DNA is useful in situations where the amount of DNA in a sample is very limited. Typical sources of DNA recovered from crime scenes include hair, bones, teeth, and body fluids such as saliva, semen, and blood.

In humans, mitochondrial DNA is inherited strictly from the mother (Case J. T. and Wallace, D. C., Somatic Cell Genetics, 1981, 7, 103-108; Giles, R. E. et al. Proc. Natl. Acad. Sci. 1980, 77, 6715-6719; Hutchison, C. A. et al. Nature, 1974, 251, 536-538). Thus, the mitochondrial DNA sequences obtained from maternally related individuals, such as a brother and a sister or a mother and a daughter, will exactly match each other in the absence of a mutation. This characteristic of mitochondrial DNA is advantageous in missing persons cases as reference mitochondrial DNA samples can be supplied by any maternal relative of the missing individual (Ginther, C. et al. Nature Genetics, 1992, 2, 135-138; Holland, M. M. et al. Journal of Forensic Sciences, 1993, 38, 542-553; Stoneking, M. et al. American Journal of Human Genetics, 1991, 48, 370-382).

The human mitochondrial DNA genome is approximately 16,569 bases in length and has two general regions: the coding region and the control region. The coding region is responsible for the production of various biological molecules involved in the process of energy production in the cell and includes about 37 genes (22 transfer RNAs, 2 ribosomal RNAs, and 13 peptides), with very little intergenic sequence and no introns. The control region is responsible for regulation of the mitochondrial DNA molecule. Two regions of mitochondrial DNA within the control region have been found to be highly polymorphic, or variable, within the human population (Greenberg, B. D. et al. Gene, 1983, 21, 33-49). These two regions are termed “hypervariable Region I” (HV 1), which has an approximate length of 342 base pairs (bp), and “hypervariable Region II” (HV2), which has an approximate length of 268 bp. Forensic mitochondrial DNA examinations are performed using these two hypervariable regions because of the high degree of variability found among individuals.

There exists a need for rapid identification of humans wherein human remains and/or biological samples are analyzed. Such remains or samples may be associated with war-related casualties, aircraft crashes, and acts of terrorism, for example. Analysis of mitochondrial DNA enables a rule-in/rule-out identification process for persons for whom DNA profiles from a maternal relative are available. Human identification by analysis of mitochondrial DNA can also be applied to human remains and/or biological samples obtained from crime scenes.

The process of human identification is a common objective of forensics investigations. As used herein, “forensics” is the study of evidence discovered at a crime or accident scene and used in a court of law. “Forensic science” is any science used for the purposes of the law, in particular the criminal justice system, and therefore provides impartial scientific evidence for use in the courts of law, and in a criminal investigation and trial. Forensic science is a multidisciplinary subject, drawing principally from chemistry and biology, but also from physics, geology, psychology and social science, for example.

Forensic scientists generally use the two hypervari able regions of human mitochondrial DNA for analysis. These hypervariable regions, or portions thereof, provide only one non-limiting example of a region of mitochondrial DNA useful for identification analysis.

A typical mitochondrial DNA analysis begins when total genomic and mitochondrial DNA is extracted from biological material, such as a tooth, blood sample, or hair. The polymerase chain reaction (PCR) is then used to amplify, or create many copies of, the two hypervariable portions of the non-coding region of the mitochondrial DNA molecule, using flanking primers. When adequate amounts of PCR product are amplified to provide all the necessary information about the two hypervariable regions, sequencing reactions are performed. Where possible, the sequences of both hypervariable regions are determined on both strands of the double-stranded DNA molecule, with sufficient redundancy to confirm the nucleotide substitutions that characterize that particular sample. The entire process is then repeated with a known sample, such as blood or saliva collected from a known individual. The sequences from both samples are compared to determine if they match. Finally, in the event of an inclusion or match, The Scientific Working Group on DNA Analysis Methods (SWGDAM) mitochondrial DNA database, which is maintained by the FBI, is searched for the mitochondrial sequence that has been observed for the samples. The analysts can then report the number of observations of this type based on the nucleotide positions that have been read. A written report can be provided to the submitting agency. This process is described in more detail in M. M. Holland and T. J. Parsons 1999, Forensic Science Review, volume 11, pages 25-51.

Approximately 610 bp of mitochondrial DNA are currently sequenced in forensic mitochondrial DNA analysis. Recording and comparing mitochondrial DNA sequences would be difficult and potentially confusing if all of the bases were listed. Thus, mitochondrial DNA sequence information is recorded by listing only the differences with respect to a reference DNA sequence. By convention, human mitochondrial DNA sequences are described using the first complete published mitochondrial DNA sequence as a reference (Anderson, S. et al., Nature, 1981, 290, 457-465). This sequence is commonly referred to as the Anderson sequence. It is also called the Cambridge reference sequence or the Oxford sequence. Each base pair in this sequence is assigned a number. Deviations from this reference sequence are recorded as the number of the position demonstrating a difference and a letter designation of the different base. For example, a transition from A to G at position 263 would be recorded as 263 G. If deletions or insertions of bases are present in the mitochondrial DNA, these differences are denoted as well.

In the United States, there are seven laboratories currently conducting forensic mitochondrial DNA examinations: the FBI Laboratory; Laboratory Corporation of America (LabCorp) in Research Triangle Park, N.C.; Mitotyping Technologies in State College, Pa.; the Bode Technology Group (BTG) in Springfield, Va.; the Armed Forces DNA Identification Laboratory (AFDIL) in Rockville, Md.; BioSynthesis, Inc. in Lewisville, Tex.; and Reliagene in New Orleans, Louisiana.

Mitochondrial DNA analyses have been admitted in criminal proceedings from these laboratories in the following states as of April 1999: Alabama, Arkansas, Florida, Indiana, Illinois, Maryland, Michigan, New Mexico, North Carolina, Pennsylvania, South Carolina, Tennessee, Texas, and Washington. Mitochondrial DNA has also been admitted and used in criminal trials in Australia, the United Kingdom, and several other European countries.

Since 1996, the number of individuals performing mitochondrial DNA analysis at the FBI Laboratory has grown from 4 to 12, with more personnel expected in the near future. Over 150 mitochondrial DNA cases have been completed by the FBI Laboratory as of March 1999, and dozens more await analysis. Forensic courses are being taught by the FBI Laboratory personnel and other groups to educate forensic scientists in the procedures and interpretation of mitochondrial DNA sequencing. More and more individuals are learning about the value of mitochondrial DNA sequencing for obtaining useful information from evidentiary samples that are small, degraded, or both. Mitochondrial DNA sequencing is becoming known not only as an exclusionary tool but also as a complementary technique for use with other human identification procedures. Mitochondrial DNA analysis will continue to be a powerful tool for law enforcement officials in the years to come as other applications are developed, validated, and applied to forensic evidence.

Presently, the forensic analysis of mitochondrial DNA is rigorous and labor-intensive. Currently, only 1-2 cases per month per analyst can be performed. Several molecular biological techniques are combined to obtain a mitochondrial DNA sequence from a sample. The steps of the mitochondrial DNA analysis process include primary visual analysis, sample preparation, DNA extraction, polymerase chain reaction (PCR) amplification, post-amplification quantification of the DNA, automated DNA sequencing, and data analysis. Another complicating factor in the forensic analysis of mitochondrial DNA is the occurrence of heteroplasmy wherein the pool of mitochondrial DNAs in a given cell is heterogeneous due to mutations in individual mitochondrial DNAs. There are different forms of heteroplasmy found in mitochondrial DNA. For example, sequence heteroplasmy (also known as point heteroplasmy) is the occurrence of more than one base at a particular position or positions in the mitochondrial DNA sequence. Length heteroplasmy is the occurrence of more than one length of a stretch of the same base in a mitochondrial DNA sequence as a result of insertion of nucleotide residues.

Heteroplasmy is a problem for forensic investigators since a sample from a crime scene can differ from a sample from a suspect by one base pair and this difference may be interpreted as sufficient evidence to eliminate that individual as the suspect. Hair samples from a single individual can contain heteroplasmic mutations at vastly different concentrations and even the root and shaft of a single hair can differ. The detection methods currently available to molecular biologists cannot detect low levels of heteroplasmy. Furthermore, if present, length heteroplasmy will adversely affect sequencing runs by resulting in an out-of-frame sequence that cannot be interpreted.

Mass spectrometry provides detailed information about the molecules being analyzed, including high mass accuracy. It is also a process that can be easily automated.

There is a need for a mitochondrial DNA forensic analysis which is both specific and rapid, and in which no nucleic acid sequencing is required. There is also a need for a method of rapid characterization and quantitation of nucleic acids which have variant positions relative to a reference sequence. These needs, as well as others, are addressed herein below.

SUMMARY OF THE INVENTION

Described herein are compositions and methods for analyzing a nucleic acid by performing the steps of obtaining a sample of nucleic acid for base composition analysis; selecting at least two primer pairs that will generate overlapping amplification products of at least two sub-segments of the nucleic acid; amplifying at least two nucleic acid sequences of a region of the nucleic acid designated as a target for base composition analysis using the primer pairs, thereby generating at least two overlapping amplification products; obtaining base compositions of the amplification products by measuring molecular masses of one or more of the amplification products using a mass spectrometer; and converting one or more of the measured molecular masses to base compositions; comparing one or more of the base compositions with one or more base compositions of reference sub-segments of a reference sequence; and identifying the presence of a particular nucleic acid sequence or variant thereof.

The nucleic acid analyzed is obtained from a human, bacterium, virus, fungus, synthetic nucleic acid source, recombinant nucleic acid source, or encodes a biological product such as a vaccine, antibody or other biological product.

Further described herein are compositions and methods for identifying a human by obtaining a sample comprising mitochondrial DNA of the human for base composition analysis; selecting at least two primer pairs that will generate overlapping amplification products representing overlapping sub-segments of the mitochondrial DNA; amplifying at least two nucleic acid sequences of a region of the mitochondrial DNA designated as a target for base composition analysis using the at least two primer pairs, thereby generating at least two overlapping amplification products; obtaining base compositions of the amplification products by measuring molecular masses of one or more of the amplification products generated using a mass spectrometer and converting one or more of the measured molecular masses to base compositions; and comparing one or more of the base compositions with one or more base compositions of reference sub-segments of a reference sequence thereby identifying the human.

Also described herein are compositions and methods for characterizing heteroplasmy of mitochondrial DNA comprising the steps of obtaining a sample comprising mitochondrial DNA for base composition analysis; selecting at least two primer pairs that will generate overlapping amplification products representing sub-segments of the mitochondrial DNA; amplifying at least two nucleic acid sequences of a region of the mitochondrial DNA designated as a target for base composition analysis using the at least two primer pairs, thereby generating at least two overlapping amplification products; obtaining base compositions of the amplification products by measuring molecular masses of one or more of the amplification products using a mass spectrometer; and converting one or more of the measured molecular masses to base compositions; comparing one or more of the base compositions with one or more base compositions of reference sub-segments of a reference sequence; and identifying at least two distinct amplification products with distinct base compositions obtained by the same pair of primers, thereby characterizing the heteroplasmy.

Also disclosed are primer pair compositions and kits comprising the same which are useful for obtaining amplification products used in genotyping organisms.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram of the definition of sub-segments of a reference sequence for amplification. Arrows indicate the position of primer hybridization for obtaining an amplification product corresponding to a sub-segment. For example, FWD-A indicates the hybridization position of the forward primer for obtaining an amplification product corresponding to Sub-segment A, while REV-A indicates the hybridization position of the reverse primer for obtaining an amplification product corresponding to sub-segment A. Overlap of one sub-segment A, which has a length of 120 nucleobases (bp) with sub-segment B is shown on the left side.

FIG. 2 is mass spectrum of three amplification products of a sample of mitochondrial DNA displaying six peaks corresponding to the individual strands of each of the three amplification products, each corresponding to sub-segments of the target mitochondrial DNA. Peaks labeled A and B are from a single amplification product of the HV1 region obtained with primer pair number 2892 (SEQ ID NOs: 4:29). Peaks labeled C and D are from a single amplification product of the HV1 region obtained with primer pair number 2901 (SEQ ID NOs: 12:37). Peaks labeled E and F are from a single amplification product of the HV2 region obtained with primer pair number 2906 (SEQ ID NOs: 17:42).

FIG. 3 represents a refinement of peaks from a mass spectrum of a sample mitochondrial DNA displaying six peak lines corresponding to the individual strands of each of the three amplification products. Detection of heteroplasmy in one of the amplified regions is indicated. Peaks labeled A and B are from a single amplification product of the HV 1 region obtained with primer pair number 2904 (SEQ ID NOs: 15:40). Peaks labeled C and D are from a single amplification product of the HV1 region obtained with primer pair number 2896 (SEQ ID NO: 8:33). Peaks labeled C′ and D′ are from a single amplification product of the HV 1 region obtained with primer pair number 2896 which represents one heteroplasmic variant of the amplification product represented by peaks C and D. Peaks labeled C″ and D″ are from a single amplification product of the HV1 region obtained with primer pair number 2896 which represents another heteroplasmic variant of the amplification product represented by peaks C and D. Peaks labeled E and F are from a single amplification product of the HV2 region obtained with primer pair number 2913 (SEQ ID NO: 22:47).

FIG. 4 is an illustration of the names and chromosome locations for the CODIS 13 markers, as well as for the AMEL markers on the X and Y chromosomes. The CODIS 13 short tandem repeats are commonly used by law enforcement for determining the source identity for a given nucleic acid.

DEFINITIONS

A number of terms and phrases are defined below:

As described herein, nucleic acids are analyzed to generate a base composition profile. Nucleic acids include, but are not limited to, human mitochondrial DNA, human, chromosomal DNA, bacterial genomic DNA, fungal DNA, viral DNA, viral RNA, commercially available plasmids or vectors or vaccines. The nucleic acids are referred to as having regions, which define as being a portion of the nucleic acid that are known or suspected to comprise genetic sequence differences that allow for the characterization of the nucleic acid. By use of the term “characterization” it is meant that the source of the nucleic acid can be identified (e.g., genetic identification of a human, identification of a recombination event in a plasmid, diagnosis of a human genetic disposition towards a disease or trait, HN typing of influenza virus strains). Part or all of a region may form the target for analysis using the disclosed material and methods. Alternatively, an entire nucleic acid can be analyzed, which is typically more useful when there are not defined regions for characterization. Thus, the whole nucleic acid will be referred to herein as region and a target. Within a target there are sub-segments. Sub-segments are the portions of nucleic acid that are flanked by primer to generate individual amplified products or amplicons. These sub-segments preferably overlap.

As used herein, “Mitochondrial DNA” refers to a circular ring of DNA which is separate from chromosomal DNA and contained as multiple copies within mitochondria. Mitochondrial DNA is often abbreviated as “mtDNA” and will be recognized as such by one with ordinary skill in the arts of mitochondrial DNA analysis. In a preferred embodiment, the objective is to identify a human. Nucleic acid is obtained from a human cell, such as a blood cell, hair, cell, skin cell or any other human cell appropriate for obtaining nucleic acid. In some embodiments, the nucleic acid is mitochondrial DNA. In some embodiments, certain portions of mitochondrial DNA are appropriate for base composition analysis such as, for example, HV1 and HV2.

As used herein, the term “HV 1” refers to a region within mitochondrial DNA known as “hypervariable region 1.” With respect to the reference Anderson/Cambridge mitochondrial DNA sequence, the HV1 region is represented by coordinates 15924 . . . 16428. This region is useful for identification of humans because it has a high degree of variability among different human individuals. In some embodiments, a defined portion of the HV1 region is analyzed by base composition analysis of “sub-segments” of the defined portion. In this embodiment, the defined portion of HV1 represents the “target.” In preferred embodiments, the entire HV1 region (coordinates 15924 . . . 16428) is divided into overlapping sub-segments. In this embodiment, the entire HV1 region represents the “target.”

As used herein, the term “HV2” refers to a region within mitochondrial DNA known as “hypervariable region 2.” With respect to the reference Anderson/Cambridge mitochondrial DNA sequence, the HV1 region is represented by coordinates 31 . . . 576. As for HV 1, the HV2 region is useful for identification of humans because it also has a high degree of variability among different human individuals. In some embodiments, a defined portion of the HV2 region is analyzed by base composition analysis of “sub-segments” of the defined portion. In this embodiment, the defined portion of HV2 represents the “target.” In preferred embodiments, the entire HV1 region (coordinates 31 . . . 576) is divided into overlapping sub-segments. In this embodiment, the entire HV2 region represents the “target.”

In other embodiments, additional target regions within the mitochondrial DNA may be chosen for base composition analysis.

As used herein, the term “target” generally refers to a nucleic acid sequence to be detected or characterized. Thus, the “target” is sought to be sorted out from other nucleic acid sequences.

As used herein, “sub-segments” are portions of a given target which are of useful size for base composition analysis. In some embodiments, the sizes of sub-segments range between about 45 to about 150 nucleobases in length. In preferred embodiments, the “sub-segments” overlap with each other and cover the entire target as shown in FIG. 1. Amplification products representing the sub-segments are obtained by amplification methods, such as PCR that are well known to those with ordinary skill in molecular biology techniques. The amplification products representing the sub-segments are analyzed by mass spectrometry to determine their molecular masses and base compositions of the amplification products are calculated from the molecular masses. The experimentally-determined base compositions are then compared with base compositions of “reference sub-segments” of a “reference nucleic acid” whose sequence and/or base composition is known. In preferred embodiments a database containing base compositions of reference nucleic acids and sub-segments thereof is used for comparison with the experimentally-determined base compositions. A match of one or more experimentally-determined base compositions of one or more sub-segments with one or more base compositions of reference sub-segments will provide the identity of the human.

The same definitions of the terms “target,” “sub-segment,” “reference sub-segment” and “reference nucleic acid” are applicable to other preferred embodiments where base composition analysis is used to identify a human by analysis of specific human chromosomal target regions such as CODIS markers for example. FIG. 4 is an illustration of the names and chromosome locations for the CODIS 13 markers, as well as for the AMEL markers on the X and Y chromosomes.

The same definitions of the terms “target,” “sub-segment,” “reference sub-segment” and “reference nucleic acid” are applicable to other preferred embodiments where base composition analysis is used to identify or characterize a genotype of a microorganism such as a bacterium, virus, or fungus for example. Characterization of genotypes of microorganisms is useful in infectious disease diagnostics for example. In these embodiments, a given target may represent the entire genome of a microorganism or a portion thereof. The target is analyzed by characterization of amplification products representing sub-segments of the target.

The same definitions of the terms “target,” “sub-segment,” “reference sub-segment” and “reference nucleic acid” are applicable to other preferred embodiments where base composition analysis is used to validate a “test nucleic acid” with respect to a reference nucleic acid. Validation of test nucleic acids is desirable in quality control of pharmaceutical production such as in production of vectors carrying genes encoding therapeutic proteins such as vaccines for example. In this embodiment, the “test nucleic acid” is expected to be identical in sequence and base composition to the reference nucleic acid. Comparison of experimentally determined base compositions of amplification products representing sub-segments of the target with base compositions of reference sub-segments may either indicate that the base compositions are identical, thereby validating the test nucleic acid, or identify a variant of the reference nucleic acid.

“Amplification” is a special case of nucleic acid replication involving template specificity. It is to be contrasted with non-specific template replication (i.e., replication that is template-dependent but not dependent on a specific template). Template specificity is here distinguished from fidelity of replication (i.e., synthesis of the proper polynucleotide sequence) and nucleotide (ribo- or deoxyribo-) specificity. Template specificity is frequently described in terms of “target” specificity.

Template or target specificity is achieved in most amplification techniques by the choice of enzyme. Amplification enzymes are enzymes that, under conditions they are used, will process only specific sequences of nucleic acid in a heterogeneous mixture of nucleic acid. For example, in the case of Qβ replicase, MDV-1 RNA is the specific template for the replicase (D. L. Kacian et al., Proc. Natl. Acad. Sci. USA 69:3038 [1972]). Other nucleic acid will not be replicated by this amplification enzyme. Similarly, in the case of T7 RNA polymerase, this amplification enzyme has a stringent specificity for its own promoters (Chamberlin et al., Nature 228:227 [1970]). In the case of T4 DNA ligase, the enzyme will not ligate the two oligonucleotides or polynucleotides, where there is a mismatch between the oligonucleotide or polynucleotide substrate and the template at the ligation junction (D. Y. Wu and R. B. Wallace, Genomics 4:560 [1989]). Finally, Tag and Pfu polymerases, by virtue of their ability to function at high temperature, are found to display high specificity for the sequences bounded and thus defined by the primers; the high temperature results in thermodynamic conditions that favor primer hybridization with the target sequences and not hybridization with non-target sequences (H. A. Erlich (ed.), PCR Technology, Stockton Press [1989]).

As used herein, the term “sample template” refers to nucleic acid originating from a sample that is analyzed for the presence of “target” (defined below). In contrast, “background template” is used in reference to nucleic acid other than sample template that may or may not be present in a sample. Background template is most often inadvertent. It may be the result of carryover, or it may be due to the presence of nucleic acid contaminants sought to be purified away from the sample. For example, nucleic acids from organisms other than those to be detected may be present as background in a test sample.

As used herein, the term “primer” refers to an oligonucleotide, whether occurring naturally, such as a purified fragment from a restriction digest, or produced synthetically, which is capable of acting as a point of initiation of synthesis when placed under conditions in which synthesis of a primer extension product which is complementary to a nucleic acid strand is induced, (i.e., in the presence of nucleotides and an inducing agent such as DNA polymerase and at a suitable temperature and pH). The primer is preferably single stranded for maximum efficiency in amplification, but may alternatively be double stranded. If double stranded, the primer is first treated to separate its strands before being used to prepare extension products. Preferably, the primer is an oligodeoxyribonucleotide. Preferably, the primer is sufficiently long to prime the synthesis of extension products in the presence of the inducing agent. The exact lengths of the primers will depend on many factors, including temperature, source of primer and the use of the method. The primers can be any useful length. Lengths of about 13 to about 35 nucleobases are preferred. One with ordinary skill in the art of molecular biology can design primers appropriate for amplification methods.

As used herein, a “pair of primers” or “a primer pair” is used for amplification of a nucleic acid sequence. A pair of primers comprises a forward primer and a reverse primer. The forward primer hybridizes to a sense strand of a target gene sequence to be amplified and primes synthesis of an antisense strand (complementary to the sense strand) using the target sequence as a template. A reverse primer hybridizes to the antisense strand of a target gene sequence to be amplified and primes synthesis of a sense strand (complementary to the antisense strand) using the target sequence as a template.

As used herein, the term “polymerase chain reaction” (“PCR”) refers to the method of K. B. Mullis U.S. Pat. Nos. 4,683,195, 4,683,202, and 4,965,188, hereby incorporated by reference, that describe a method for increasing the concentration of a segment of a target sequence in a mixture of genomic DNA without cloning or purification. This process for amplifying the target sequence consists of introducing a large excess of two oligonucleotide primers to the DNA mixture containing the desired target sequence, followed by a precise sequence of thermal cycling in the presence of a DNA polymerase. The two primers are complementary to their respective strands of the double stranded target sequence. To effect amplification, the mixture is denatured and the primers then annealed to their complementary sequences within the target molecule. Following annealing, the primers are extended with a polymerase so as to form a new pair of complementary strands. The steps of denaturation, primer annealing, and polymerase extension can be repeated many times (i.e., denaturation, annealing and extension constitute one “cycle”; there can be numerous “cycles”) to obtain a high concentration of an amplified segment of the desired target sequence. The length of the amplified segment of the desired target sequence is determined by the relative positions of the primers with respect to each other, and therefore, this length is a controllable parameter. By virtue of the repeating aspect of the process, the method is referred to as the “polymerase chain reaction” (hereinafter “PCR”). Because the desired amplified segments of the target sequence become the predominant sequences (in terms of concentration) in the mixture, they are said to be “PCR amplified.”

With PCR, it is possible to amplify a single copy of a specific target sequence in genomic DNA to a level detectable by several different methodologies (e.g., hybridization with a labeled probe; incorporation of biotinylated primers followed by avidin-enzyme conjugate detection; incorporation of ³²P-labeled deoxynucleotide triphosphates, such as dCTP or dATP, into the amplified segment). In addition to genomic DNA, any oligonucleotide or polynucleotide sequence can be amplified with the appropriate set of primer molecules. In particular, the amplified segments created by the PCR process itself are, themselves, efficient templates for subsequent PCR amplifications.

As used herein, the terms “PCR product,” “PCR fragment,” and “amplification product” refer to the nucleic acid product obtained after two or more cycles of the PCR steps of denaturation, annealing and extension are complete. These terms encompass the case where there has been amplification of one or more segments of one or more target sequences.

As used herein, the term “amplification reagents” refers to those reagents (deoxyribonucleotide triphosphates, buffer, etc.), needed for amplification except for primers, nucleic acid template, and the amplification enzyme. Typically, amplification reagents along with other reaction components are placed and contained in a reaction vessel (test tube, microwell, etc.).

As used herein, the terms “complementary” or “complementarity” are used in reference to polynucleotides (i.e., a sequence of nucleotides such as an oligonucleotide or a target nucleic acid) related by the base-pairing rules. For example, for the sequence “5′-A-G-T-3′,” is complementary to the sequence “3′-T-C-A-5′.” Complementarity may be “partial,” in which only some of the nucleic acids' bases are matched according to the base pairing rules. Or, there may be “complete” or “total” complementarity between the nucleic acids. The degree of complementarity between nucleic acid strands has significant effects on the efficiency and strength of hybridization between nucleic acid strands. This is of particular importance in amplification reactions, as well as detection methods which depend upon binding between nucleic acids. Either term may also be used in reference to individual nucleotides, especially within the context of polynucleotides. For example, a particular nucleotide within an oligonucleotide may be noted for its complementarity, or lack thereof, to a nucleotide within another nucleic acid strand, in contrast or comparison to the complementarity between the rest of the oligonucleotide and the nucleic acid strand.

The terms “homology,” “homologous” and “sequence identity” refer to a degree of identity. There may be partial homology or complete homology. A partially homologous sequence is one that is less than 100% identical to another sequence. Determination of sequence identity is described in the following example: a primer 20 nucleobases in length which is otherwise identical to another 20 nucleobase primer but having two non-identical residues has 18 of 20 identical residues (18/20=0.9 or 90% sequence identity). In another example, a primer 15 nucleobases in length having all residues identical to a 15 nucleobase segment of primer 20 nucleobases in length would have 15/20=0.75 or 75% sequence identity with the 20 nucleobase primer. In context of the present invention, sequence identity is meant to be properly determined when the query sequence and the subject sequence are both described in the 5′ to 3′ direction.

As used herein, the term “hybridization” is used in reference to the pairing of complementary nucleic acids. Hybridization and the strength of hybridization (i.e., the strength of the association between the nucleic acids) is influenced by such factors as the degree of complementary between the nucleic acids, stringency of the conditions involved, and the T_(m) of the formed hybrid. “Hybridization” methods involve the annealing of one nucleic acid to another, complementary nucleic acid, i.e., a nucleic acid having a complementary nucleotide sequence. The ability of two polymers of nucleic acid containing complementary sequences to find each other and anneal through base pairing interaction is a well-recognized phenomenon. The initial observations of the “hybridization” process by Marmur and Lane, Proc. Natl. Acad. Sci. USA 46:453 (1960) and Doty et al., Proc. Natl. Acad. Sci. USA 46:461 (1960) have been followed by the refinement of this process into an essential tool of modern biology.

The complement of a nucleic acid sequence as used herein refers to an oligonucleotide which, when aligned with the nucleic acid sequence such that the 5′ end of one sequence is paired with the 3′ end of the other, is in “antiparallel association.” Certain bases not commonly found in natural nucleic acids may be included in the nucleic acids of the present invention and include, for example, inosine and 7-deazaguanine. Complementarity need not be perfect; stable duplexes may contain mismatched base pairs or unmatched bases. Those skilled in the art of nucleic acid technology can determine duplex stability empirically considering a number of variables including, for example, the length of the oligonucleotide, base composition and sequence of the oligonucleotide, ionic strength and incidence of mismatched base pairs.

As used herein, the term “T_(m)” is used in reference to the “melting temperature.” The melting temperature is the temperature at which a population of double-stranded nucleic acid molecules becomes half dissociated into single strands. Several equations for calculating the T_(m) of nucleic acids are well known in the art. As indicated by standard references, a simple estimate of the T_(m) value may be calculated by the equation: T_(m)=81.5+0.41(% G+C), when a nucleic acid is in aqueous solution at 1 M NaCl (see e.g., Anderson and Young, Quantitative Filter Hybridization, in Nucleic Acid Hybridization (1985). Other references (e.g., Allawi, H. T. & SantaLucia, J., Jr. Thermodynamics and NMR of internal G.T mismatches in DNA. Biochemistry 36, 10581-94 (1997) include more sophisticated computations which take structural and environmental, as well as sequence characteristics into account for the calculation of T_(m).

The term “gene” refers to a DNA sequence that comprises control and coding sequences necessary for the production of an RNA having a non-coding function (e.g., a ribosomal or transfer RNA), a polypeptide or a precursor. The RNA or polypeptide can be encoded by a full length coding sequence or by any portion of the coding sequence so long as the desired activity or function is retained.

The term “wild-type” refers to a gene or a gene product that has the characteristics of that gene or gene product when isolated from a naturally occurring source. A wild-type gene is that which is most frequently observed in a population and is thus arbitrarily designated the “normal” or “wild-type” form of the gene. In contrast, the term “modified”, “mutant” or “polymorphic” refers to a gene or gene product which displays modifications in sequence and or functional properties (i.e., altered characteristics) when compared to the wild-type gene or gene product. It is noted that naturally-occurring mutants can be isolated; these are identified by the fact that they have altered characteristics when compared to the wild-type gene or gene product.

The term “oligonucleotide” as used herein is defined as a molecule comprising two or more deoxyribonucleotides or ribonucleotides, preferably at least 5 nucleotides, more preferably at least about 13 to 35 nucleotides. The exact size will depend on many factors, which in turn depend on the ultimate function or use of the oligonucleotide. The oligonucleotide may be generated in any manner, including chemical synthesis, DNA replication, reverse transcription, PCR, or a combination thereof.

Because mononucleotides are reacted to make oligonucleotides in a manner such that the 5′ phosphate of one mononucleotide pentose ring is attached to the 3′ oxygen of its neighbor in one direction via a phosphodiester linkage, an end of an oligonucleotide is referred to as the “5′-end” if its 5′ phosphate is not linked to the 3′ oxygen of a mononucleotide pentose ring and as the “3′-end” if its 3′ oxygen is not linked to a 5′ phosphate of a subsequent mononucleotide pentose ring. As used herein, a nucleic acid sequence, even if internal to a larger oligonucleotide, also may be said to have 5′ and 3′ ends. A first region along a nucleic acid strand is said to be upstream of another region if the 3′ end of the first region is before the 5′ end of the second region when moving along a strand of nucleic acid in a 5′ to 3′ direction. All oligonucleotide primers disclosed herein are understood to be presented in the 5′ to 3′ direction when reading left to right.

When two different, non-overlapping oligonucleotides anneal to different regions of the same linear complementary nucleic acid sequence, and the 3′ end of one oligonucleotide points towards the 5′ end of the other, the former may be called the “upstream” oligonucleotide and the latter the “downstream” oligonucleotide. Similarly, when two overlapping oligonucleotides are hybridized to the same linear complementary nucleic acid sequence, with the first oligonucleotide positioned such that its 5′ end is upstream of the 5′ end of the second oligonucleotide, and the 3′ end of the first oligonucleotide is upstream of the 3′ end of the second oligonucleotide, the first oligonucleotide may be called the “upstream” oligonucleotide and the second oligonucleotide may be called the “downstream” oligonucleotide.

The term “primer” refers to an oligonucleotide that is capable of acting as a point of initiation of synthesis when placed under conditions in which primer extension is initiated. An oligonucleotide “primer” may occur naturally, as in a purified restriction digest or may be produced synthetically. A primer is selected to be “substantially” complementary to a strand of specific sequence of the template. A primer must be sufficiently complementary to hybridize with a template strand for primer elongation to occur. A primer sequence need not reflect the exact sequence of the template. For example, a non-complementary nucleotide fragment may be attached to the 5′ end of the primer, with the remainder of the primer sequence being substantially complementary to the strand. Non-complementary bases or longer sequences can be interspersed into the primer, provided that the primer sequence has sufficient complementarity with the sequence of the template to hybridize and thereby form a template primer complex for synthesis of the extension product of the primer.

The term “target nucleic acid” refers to a nucleic acid molecule containing a sequence that has at least partial complementarity with an oligonucleotide primer. The target nucleic acid may comprise single- or double-stranded DNA or RNA.

The term “variable sequence” as used herein refers to differences in nucleic acid sequence between two nucleic acids. For example, the same gene of two different bacterial species may vary in sequence by the presence of single base substitutions and/or deletions or insertions of one or more nucleotides. These two forms of the structural gene are said to vary in sequence from one another.

The term “nucleotide analog” as used herein refers to modified or non-naturally occurring nucleotides such as 5-propynyl pyrimidines (i.e., 5-propynyl-dTTP and 5-propynyl-dTCP), 7-deaza purines (i.e., 7-deaza-dATP and 7-deaza-dGTP). Nucleotide analogs include base analogs and comprise modified forms of deoxyribonucleotides as well as ribonucleotides.

The term “microorganism” as used herein means an organism too small to be observed with the unaided eye and includes, but is not limited to bacteria, virus, protozoans, fungi; and ciliates.

The term “microbial gene sequences” refers to gene sequences derived from a microorganism.

The term “bacteria” or “bacterium” refers to any member of the groups of eubacteria and archaebacteria.

The term “virus” refers to obligate, ultramicroscopic, intracellular parasites incapable of autonomous replication (i.e., replication requires the use of the host cell's machinery).

The term “sample” in the present specification and claims is used in its broadest sense. On the one hand it is meant to include a specimen or culture (e.g., microbiological cultures). On the other hand, it is meant to include both biological and environmental samples. A sample may include a specimen of synthetic origin.

Biological samples may be animal, including human, fluid, solid (e.g., stool) or tissue, as well as liquid and solid food and feed products and ingredients such as dairy items, vegetables, meat and meat by-products, and waste. Biological samples may be obtained from all of the various families of domestic animals, as well as feral or wild animals, including, but not limited to, such animals as ungulates, bear, fish, lagamorphs, rodents, etc.

Environmental samples include environmental material such as surface matter, soil, water and industrial samples, as well as samples obtained from food and dairy processing instruments, apparatus, equipment, utensils, disposable and non-disposable items. These examples are not to be construed as limiting the sample types applicable to the present invention.

The term “source of target nucleic acid” refers to any sample that contains nucleic acids (RNA or DNA). Particularly preferred sources of target nucleic acids are biological samples including, but not limited to blood, saliva, cerebral spinal fluid, pleural fluid, milk, lymph, sputum and semen. The source of nucleic acid may also be an organism such as a human, animal, bacterium, virus or fungus for example.

The term “polymerization means” or “polymerization agent” refers to any agent capable of facilitating the addition of nucleoside triphosphates to an oligonucleotide. Preferred polymerization means comprise DNA and RNA polymerases.

The term “adduct” is used herein in its broadest sense to indicate any compound or element that can be added to an oligonucleotide. An adduct may be charged (positively or negatively) or may be charge-neutral. An adduct may be added to the oligonucleotide via covalent or non-covalent linkages. Examples of adducts include, but are not limited to, indodicarbocyanine dye amidites, amino-substituted nucleotides, ethidium bromide, ethidium homodimer, (1,3-propanediamino)propidium, (diethylenetriamino)propidium, thiazolc orange, (N-N′-tetramethyl-1,3-propanediamino)propyl thiazole orange, (N-N′-tetramethyl-1,2-ethanediamino)propyl thiazole orange, thiazole orange-thiazole orange homodimer (TOTO), thiazole orange-thiazole blue heterodimer (TOTAB), thiazole orange-ethidium heterodimer 1 (TOED1), thiazole orange-ethidium heterodimer 2 (TOED2) and fluorescein-ethidium heterodimer (FED), psoralens, biotin, streptavidin, avidin, etc.

Where a first oligonucleotide is complementary to a region of a target nucleic acid and a second oligonucleotide has complementary to the same region (or a portion of this region) a “region of overlap” exists along the target nucleic acid. The degree of overlap will vary depending upon the nature of the complementarity.

As used herein, the term “purified” or “to purify” refers to the removal of contaminants from a sample.

As used herein the term “portion” when in reference to a protein (as in “a portion of a given protein”) refers to fragments of that protein. The fragments may range in size from four amino acid residues to the entire amino acid sequence minus one amino acid (e.g., 4, 5, 6, . . . , n−1).

The term “nucleic acid” or “nucleic acid sequence” as used herein refers to an oligonucleotide, nucleotide or polynucleotide, and fragments or portions thereof, and to DNA or RNA of genomic or synthetic origin which may be single or double stranded, and represent the sense or antisense strand. Similarly, “amino acid sequence” as used herein refers to peptide or protein sequence.

The term “peptide nucleic acid” (“PNA”) as used herein refers to a molecule comprising bases or base analogs such as would be found in natural nucleic acid, but attached to a peptide backbone rather than the sugar-phosphate backbone typical of nucleic acids. The attachment of the bases to the peptide is such as to allow the bases to base pair with complementary bases of nucleic acid in a manner similar to that of an oligonucleotide. These small molecules, also designated anti gene agents, stop transcript elongation by binding to their complementary strand of nucleic acid (Nielsen, et al. Anticancer Drug Des. 8:53 63 [1993]).

The term “locked nucleic acid (“LNA”) as used herein, refers to a conformationally restricted nucleic acid analogue, in which the ribose ring is locked into a rigid C3′-endo (or Northern-type) conformation by a simple 2′-O, 4′-C methylene bridge. Duplexes involving LNA (hybridized to either DNA or RNA) display a large increase in melting temperatures of between +3.0 to +9.3° C. per LNA modification, in comparison to corresponding unmodified reference duplexes. LNA recognizes both DNA and RNA with remarkable affinities and selectivities. Incorporation of a given number of LNA monomers into oligonucleotides is a very convenient way of vastly improving the stability and specificity of duplexes toward complementary RNA or DNA such as, for example, primer binding regions.

As used herein, the terms “purified” or “substantially purified” refer to molecules, either nucleic or amino acid sequences, that are removed from their natural environment, isolated or separated, and are at least 60% free, preferably 75% free, and most preferably 90% free from other components with which they are naturally associated. An “isolated polynucleotide” or “isolated oligonucleotide” is therefore a substantially purified polynucleotide.

The term “duplex” refers to the state of nucleic acids in which the base portions of the nucleotides on one strand are bound through hydrogen bonding the their complementary bases arrayed on a second strand. The condition of being in a duplex form reflects on the state of the bases of a nucleic acid. By virtue of base pairing, the strands of nucleic acid also generally assume the tertiary structure of a double helix, having a major and a minor groove. The assumption of the helical form is implicit in the act of becoming duplexed.

The term “template” refers to a strand of nucleic acid on which a complementary copy is built from nucleoside triphosphates through the activity of a template-dependent nucleic acid polymerase. Within a duplex the template strand is, by convention, depicted and described as the “bottom” strand. Similarly, the non-template strand is often depicted and described as the “top” strand.

The term “template-dependent RNA polymerase” refers to a nucleic acid polymerase that creates new RNA strands through the copying of a template strand as described above and which does not synthesize RNA in the absence of a template. This is in contrast to the activity of the template-independent nucleic acid polymerases that synthesize or extend nucleic acids without reference to a template, such as terminal deoxynucleotidyl transferase, or Poly A polymerase.

The term “in silico” when used in relation to a process indicates that the process is simulated on or embedded in a computer.

The term “priming region” refers to a region on a target nucleic acid sequence to which a primer hybridizes for the purpose of extension of the complementary strand of the target nucleic acid sequence.

The term “non-templated T residue” as used herein refers to a thymidine (T) residue added to the 5′ end of a primer which does not necessarily hybridize to the target nucleic acid being amplified.

The term “genotype” as used herein refers to at least a portion of the genetic makeup of an individual. A portion of a genome can be sufficient for assignment of a genotype to an individual provided that the portion of the genome contains a representative sequence or base composition to distinguish the genotype from other genotypes.

The term “nucleobase” as used herein is synonymous with other terms in use in the art including “nucleotide,” “deoxynucleotide,” “nucleotide residue,” “deoxynucleotide residue,” “nucleotide triphosphate (NTP),” or deoxynucleotide triphosphate (dNTP).

As defined herein, “base composition” refers to the numbers of each of the four standard nucleobases that are present within a given standard sequence or corresponding amplification product of a standard, test or variant sequence. Methods including steps of measuring base compositions are disclosed and claimed in commonly owned published U.S. Patent Application Nos: 20030124556, 20030082539, 20040209260, 20040219517, and 20040180328 and U.S. Ser. Nos. 10/728,486, 10/829,826, 10/660,998, 10/853,660, 60/604,329, 60/632,862, 60/639,068, 60/648,188, Ser. Nos. 11/060,135, 11/073,362, and 60/658,248, each of which is incorporated herein by reference in entirety.

As used herein, the term “base composition analysis” refers to determination of the base composition of an amplification product representing a sub-segment of a target nucleic acid sequence from the molecular mass of the amplification product determined by mass spectrometry. In embodiments of the present invention, base composition analysis may include determination of base compositions of two or more amplification products representing overlapping sub-segments of a nucleic acid sequence which are to be compared with the defined base compositions of the corresponding overlapping sub-segments of one or more reference nucleic acids

As used herein, the term “reference nucleic acid” or “reference nucleic acid segment” is a characterized nucleic acid of known sequence and/or known base composition. A reference nucleic acid segment is compared with uncharacterized sequences in various embodiments of the present invention. For example, a characterized vector or portion thereof can be used as a reference nucleic acid segment. A characterized portion of human nucleic acid may also be used as a reference nucleic acid provided the genotype, identity or race of the human from which the reference nucleic acid is obtained is known. A genome or a portion thereof of a bacterium, virus or fungus may also be employed as a reference nucleic acid provided that the species or genotype of the bacterium, virus or fungus is known.

As used herein, the term “reference base composition” refers to a characterized base composition. For example, a sub-segment of a reference nucleic acid having the defined sequence AAAAATTTTCCCGG has a standard base composition of A₅ T₄ C₃ G₂.

As used herein, the term “test nucleic acid sequence” refers to an uncharacterized nucleic acid sequence whose base composition is to be characterized and compared with one or more standard nucleic acid segments.

As used herein, term “overlap” or “overlapping sub-segments” refers to sub-segments of a standard nucleic acid segment which have overlap as illustrated by the following example which employs a standard nucleic acid segment of length of 300 nucleobases. A first sub-segment may, for example, extend from position 1 to position 100. A second sub-segment may, for example, extend from position 60 to position 160, having overlap from position 60 to position 100. A third sub-segment may, for example, extend from position 120 to position 220, having overlap from position 120 to position 160. A fourth sub-segment may, for example, extend from position 180 to position 280, having overlap from position 180 to position 220. Producing sub-segments with overlap is useful because it provides redundancy and reduces the likelihood that sub-segments containing variants relative to a given standard sub-segment will be mischaracterized. If a primer used to amplify a given sub-segment hybridizes to a position with a mutation relative to the reference sequence, the amplification product will not contain the mutation because the primer extension product is used as a subsequent template in subsequent amplification cycles. Thus, having overlap of two sub-segments wherein overlap of the second sub-segment over the first sub-segment extends past the reverse primer hybridization site of the first sub-segment eliminates the possibility that the reverse primer for the first sub-segment will mask a given mutation within the first sub-segment reverse primer hybridization site. The extent of minimal overlap should be determined by the length of the primer hybridization site of a given sub-segment. Generally, overlap of sub-segments by several nucleobases is appropriate but shorter overlap lengths may also be appropriate provided the primer hybridization sites are shorter nucleobases. The avoidance of overlap of primer hybridization sites on overlapping sub-segments is preferred.

As used herein, the term “co-amplification” or “co-amplified” refers to the process of obtaining more than one amplification product in the same amplification reaction mixture using the same pair of primers.

As used herein, the term “vector” refers to a nucleic acid adapted for transfection into a host cell. Examples of vectors include, but are not limited to, plasmids, cosmids, bacteriophages and the like.

As used herein, the term “therapeutic protein” refers to any protein product produced by biotechnological methods for use as a therapeutic product. Examples of therapeutic proteins include, but are not limited to protein products such as vaccines, antibodies, structural proteins, hormones, and cell signaling proteins such as receptors, cytokines and the like.

As used herein, the term “recombinant” refers to having been created by genetic engineering. For example, a “recombinant insert” refers to a nucleic acid segment inserted into another nucleic acid sequence using techniques well known to those with ordinary skill in the arts of genetic engineering and molecular biology.

A “nucleic acid variant” is herein defined as a nucleic acid having substantial similarity or sequence identity with a “standard” nucleic acid sequence. For example, between about 70% up to but not including 100% sequence identity.

As used herein, a “triplex combination of primer pairs” refers to three primer pairs which is to be included in an amplification mixture for the purpose of obtaining three distinct amplification products from a given target nucleic acid.

DESCRIPTION OF EMBODIMENTS

Provided herein are compositions and methods for determining the presence of a nucleic acid variant or a genotype relative to a known and defined “reference” nucleic acid sequence. Identification of a distinct genotype in certain embodiments is satisfied by identification of a distinct base composition of a given sub-segment of a target nucleic acid.

In the methods described herein where the genotype, and in turn the identity, of a nucleic acid sample is determined, the nucleic acid is measured to deliver a base composition profile. That measured base composition profile is then compared to a reference base composition profile that is further associated with an identity. The reference base composition can be a head-to-head comparison or a standard reference database. In both the head-to-head comparison and the standard reference database comparison, the unknown sample is analyzed using the disclosed compositions and methods to generate a measured base composition profile. For the head-to-head comparison, the reference base composition profile is generated by similarly analyzing samples from a selected suspect population using the disclosed compositions and methods. The measured base composition is then compared to the reference base compositions and if a match occurs between the unknown and a suspect, then the identity is determined. In the standard reference database comparison the measured base composition is compared to a pre-existing database of reference base compositions. This database can be populated using standard reference nucleic acids, previously measured base composition and converted data to generate base compositions. For example, but not limitation, a standard reference nucleic acid can include commercially available vectors like pUC, the certified values for CODIS 13 loci (SRM 2391b available from the National Institute of Standards and Technology) and the Anderson mitochondrial DNA sequence. Converted data can include, but is not limited to, previously obtained sequence data, such as the reference data that is stored in the SWGDAM database that is bioinformatically converted to base composition data.

Also provided herein are compositions and methods for identifying a human by comparison of base compositions of amplification products representing overlapping sub-segments of a target nucleic acid with base compositions of reference sub-segments of one or more reference nucleic acids.

Amplification products of portions of the target nucleic acid which correspond to the sub-segments are produced and their molecular masses are measured by mass spectrometry. Base compositions of the amplification products are calculated from their molecular masses and the base compositions are compared with the base compositions of the corresponding sub-segments of the reference nucleic acid. A given target region can have any length depending upon the type of analysis to be conducted and in recognition of the numbers of primer pairs required to obtain amplification products representing overlapping sub-segments of the target, If a bacterium with a large genome is to be analyzed, and the target is the entire genome, a target nucleic acid may have a length of several kilobases. Alternatively, a target region may be of a length of about 300 to about 1000 nucleobases in length.

In some embodiments, the nucleic acid variant has a sequence identical to the standard sequence with the exception of having one or more single nucleotide polymorphisms, insertions or deletions.

In some embodiments, the reference nucleic acid and variant nucleic acid is either single stranded or double stranded DNA or RNA. In some embodiments, the standard and variant nucleic acid originates from the genome of a bacterium or a virus or is a synthesized nucleic acid such as a PCR product, for example.

A set of sub-segments within the reference nucleic acid sequence is defined. In some embodiments, the members of the set of standard sub-segments are from about 45 to about 150 nucleobases in length. One will recognize that this includes standard sub-segments of lengths of 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, or 150 nucleobases in length.

In some embodiments, the molecular masses of the test amplification products are determined by mass spectrometry such as electrospray Fourier transform ion cyclotron resonance (FTICR) mass spectrometry or electrospray time-of-flight mass spectrometry. The use of electrospray mass spectrometry permits the measurement of large amplification products, as large as 500 nucleobases in length, whereas amplification products analyzed by matrix-assisted laser desorption ionization mass spectrometry are typically much smaller in length (approximately 15 nucleobases in length).

If desired, the length of the standard segments can be chosen such that some members of the set have calculated molecular masses that are dissimilar from other members of the set. Having standard segments of dissimilar molecular masses allows for multiplexing or pooling of amplification products corresponding to the standard segments prior to molecular mass determination, by mass spectrometry for example. As is illustrated in FIGS. 2 and 3, the resultant amplification products from a reaction using the at least two primer pairs are sufficiently separated along the charge axis of the mass spectrometry plot. This separation is preferred, but not necessary, because the individually measured amplicon strands can be easily visualized.

In some embodiments, the compositions and methods are used for genotyping of a suspected variant of a known species of bacterium or virus. The base compositions of the test amplification products, if different from the base composition of the standard segments, provide the means for identification of a previously known variant, or for characterization of a previously unobserved variant.

In some embodiments, the compositions and methods are used for identification and characterization of genetically engineered bacteria or viruses. Genetically engineered organisms are produced by insertion or deletion of genes. These modifications are readily detectable by the methods of the present invention.

In some embodiments, the compositions and methods can be used for validation of reference nucleic acid sequences such as those encoding therapeutic proteins including but not limited to vaccines and biological drugs such as monoclonal antibodies for example. A nucleic acid is “validated” by base composition analysis according to the method of the present invention, wherein the result indicates that the analyzed nucleic acid and/or sub-segments thereof have the same base compositions as the reference nucleic acid. The process of “validation” confirms that polymorphisms have not been introduced into the target sequence relative to the reference sequence.

In some embodiments, a known quantity of the standard sequence is included in the sample (as an internal calibration standard) containing the suspected variant and the quantity of the variant is determined from the abundance data obtained from mass spectrometry for example. Methods of using internal calibration standards in base composition analyses are described in commonly owned U.S. application Ser. No. 11/059,776 which is incorporated herein by reference in entirety.

In some embodiments, the compositions and methods are used for characterization of heterogeneity of a standard nucleic acid test sample. For example, the standard nucleic acid test sample can be a vaccine vector having a standard sequence. The present invention can be used to identify a variant of said standard sequence and also determine the quantity of the variant relative to the standard sequence. Such an analysis is advantageous, for example, in situations requiring rapid throughput analysis for quality control. The methods described herein will be able to determine if the quantity of a variant sub-population increases to the point wherein quality of the product is compromised.

In some embodiments, the compositions and methods are used for identification of a genotype of a given organism. This can be accomplished by first selecting a series of primer pairs for amplification of consecutive or overlapping segments of a standard nucleic acid region found across known genotypes of a given organism. The process continues by amplifying a test nucleic acid of an organism of unknown genotype with the series of primer pairs to obtain a corresponding series of amplification products, at least some of which are then measured by mass spectrometry. Base compositions of the amplification products are then calculated from the molecular masses. These base compositions are compared with measured or calculated amplification product base compositions representing amplification products of known genotypes of a given organism obtained with the same series of primers. One or more matches of known and unknown base compositions provide the genotype of the organism.

Preferably, at least some or all of the amplification products have a range of lengths between about 45 to about 150 nucleobases. However, and depending on the mass spectrometer instrument used, the amplification products analyzed by mass spectrometry can be as large as about 500 nucleobases. Moreover, very large amplification products can be digested into smaller fragments that are compatible with the mass spectrometer used. Methods of base composition analysis are described in commonly owned U.S. patent application Ser. Nos. 10/660,998, 10/853,660, and 11/209,439, each of which are incorporated herein by reference in entirety.

In some embodiments, the amplification is effected using the polymerase chain reaction (PCR). In some embodiments, the PCR reaction is performed with an extension cycle having a length of one second. The one second extension cycle is shorter than an ordinary extension cycle and is employed for the purpose of minimization of artifact amplification products arising from target site crossover.

In some embodiments, the organism of unknown genotype is a human individual. In some embodiments, obtaining a genotypic result for a human individual provides the means to draw a forensic conclusion with regard to the individual, for example, to conclude with a very high probability that the individual has had contact with another individual or was present at a particular location.

In some embodiments with applications in human forensics, a given forensic nucleic acid sample may be characterized by base composition analysis that includes comparison with members of a database of tens, hundreds or even thousands of reference nucleic acid segments obtained from individuals of known identity or racial profile, or with standard references like the Anderson mitochondrial DNA sequence. Such a database can be stored on or embedded in a computer-readable medium and accessed over a network such as the internet for example. Preferably the database comprises base compositions of individual sub-segments of the reference nucleic acids.

In some embodiments, the nucleic acid being amplified for a genotyping analysis is mitochondrial DNA. In other embodiments, the nucleic acid is chromosomal DNA.

In some embodiments, the mitochondrial DNA being amplified for a genotyping analysis is from one or both of the highly variable regions HV1 or HV2.

In some embodiments, the length of the DNA region being analyzed is 300 to 700 nucleobases in length. In other embodiments, the length of the DNA region being analyzed in 400 to 600 nucleobases in length or any length therewithin.

In some embodiments, the amplifying step of the method is carried out in the presence of a dNTP containing a molecular mass-modifying tag. In some embodiments, only one of the four canonical dNTPs has the molecular mass-modifying tag. In some embodiments, the dNTP containing the molecular mass-modifying tag is 2′-deoxy-guanosine-5′-triphosphase, which has the greatest mass of the four canonical dNTPs. In other embodiments, any of the other three canonical dNTPs can contain the molecular mass-modifying tag. In some embodiments, the tag comprises a minor isotope of carbon or nitrogen. In some embodiments, the isotope of the molecular mass-modifying tag is ¹³C or ¹⁵N. The advantage to employing the latter mass-modifying tags is that the dNTP structure is not altered and thus, efficiency of the amplification process should be retained.

In some embodiments, the 3′ end residue of each primer hybridizes to a conserved nucleic acid residue of the target nucleic acid wherein the conserved nucleic acid residue is conserved among different genotypes. In other embodiments, the final two 3′ end residues of each primer hybridizes to a conserved nucleic acid residue of the target nucleic acid wherein the conserved nucleic acid residue is conserved among different genotypes. In other embodiments, the final three 3′ end residues of each primer hybridizes to a conserved nucleic acid residue of the target nucleic acid wherein the conserved nucleic acid residue is conserved among different genotypes.

In some embodiments, multiplexing amplification reactions are carried out with at least two primer pairs. In other embodiments, multiplexing reactions are carried out with three primer pairs, also known as triplex combinations.

In some embodiments, the compositions and methods are used for characterization of length or base composition heteroplasmy in mitochondrial DNA and also for determination of the quantity of a given heteroplasmic variant relative to a “standard” mitochondrial DNA region. In some embodiments, characterization of length heteroplasmy is used to diagnose and/or evaluate the progression of a mitochondrial DNA-related genetic disease such as one or more of the following mitochondrial diseases: Alpers Disease, Barth syndrome, Beta-oxidation Defects, Carnitine-Acyl-Carnitine Deficiency, Carnitine Deficiency, Co-Enzyme Q10 Deficiency, Complex I Deficiency, Complex II Deficiency, Complex III Deficiency, Complex IV Deficiency, Complex V Deficiency, COX Deficiency, CPEO, CPT I Deficiency, CPT II Deficiency, Glutaric Aciduria Type II, KSS, Lactic Acidosis, LCAD, LCHAD, Leigh Disease or Syndrome, LHON, Lethal Infantile Cardiomyopathy, Luft Disease, MAD, MCA, MELAS, MERRF, Mitochondrial Cytopathy, Mitochondrial DNA Depletion, Mitochondrial Encephalopathy, Mitochondrial Myopathy, MNGIE, NARP, Pearson Syndrome, Pyruvate Carboxylase Deficiency, Pyruvate Dehydrogenase Deficiency, Respiratory Chain, SCAD, SCHAD, or VLCAD.

Determination of sequence identity is described in the following example: a nucleic acid 20 nucleobases in length which is otherwise identical to another 20 nucleobase nucleic acid but having two non-identical residues has 18 of 20 identical residues has 18/20=0.9 or 90% sequence identity. In another example, a nucleic acid 15 nucleobases in length having all residues identical to a 15 nucleobase segment of a nucleic acid 20 nucleobases in length would have 15/20=0.75 or 75% sequence identity with the 20 nucleobase nucleic acid. In another example, a nucleic acid 17 nucleobases in length having all residues identical to a 15 nucleobase segment of a nucleic acid 20 nucleobases in length would have 15/17=0.882 or 88.2% sequence identity. In some embodiments, a nucleic acid variant has between about 70% and 99% sequence identity with a standard nucleic acid sequence. In other embodiments, the nucleic acid variant has between about 75% to about 99% sequence identity. In other embodiments, the nucleic acid has between about 80% to about 99% sequence identity. In other embodiments, the nucleic acid has between about 85% to about 99% sequence identity. In other embodiments, the nucleic acid has between about 90% to about 99% sequence identity. In other embodiments, the nucleic acid has between about 95% to about 99% sequence identity. One will recognize that these embodiments provide for nucleic acid variants having sequence identity with a standard nucleic acid sequence ranging from about 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, or 98%, to about 99%, as well as fractions thereof.

EXAMPLES Example 1 Selection of Primers for Analysis of Mitochondrial DNA

An alignment of 5615 mitochondrial DNA sequences was constructed and analyzed for regions of conservation which are useful as primer binding sites for tiling coverage of the mitochondrial DNA regions HV1 and HV2. A total of 24 primer binding sites were chosen according to the criterion that the 5′-end of the primer binding sites remain conserved across the alignment of mitochondrial DNA sequences. In some cases, only the 5′-terminal nucleobase itself is conserved. In other cases, as many as two or three consecutive nucleobases at the 5′ end of the primer binding sites are conserved.

In cases where primer coverage at a particular region is desired but complete conservation is absent, backup primer pairs can be chosen to ensure that target sequences will be amplified. For example, the 5′ end of the primer binding site for the forward primer of primer pair number 2893 is 99.7% conserved among the 5615 mitochondrial DNA sequences of the alignment, a backup primer pair was designed. Primer pair number 2894 has a G residue instead of an A residue because A is 0.3% conserved at the 5′ end of the primer binding site.

Table 1 shows the panel of 25 primer pairs designed to tile the informative HV1 (coordinates 15924 . . . 16428) and HV2 (coordinates 31-576) mitochondrial DNA regions for complete and partially redundant coverage with partially overlapping amplification products according to the general scheme shown in FIG. 1. The extent of overlap may vary but generally overlapping regions relative to two amplification products should range from about ten nucleobases to about 50 nucleobases of overlap. The sizes of amplification products produced with the primer pairs of Table 1 range in length from 85 to 140 nucleobase pairs. With the exception of three amplification products, all are less than 130 nucleobase pairs. The coordinates of the primer binding sites are given in the forward and reverse primer names with reference to the standard Anderson mitochondrial DNA sequence (SEQ ID NO: 51). For example, the forward primer of primer pair number 2889 (SEQ TD NO: 1) hybridizes to coordinates 16357-16376 of the standard Anderson mitochondrial DNA sequence (SEQ ID NO: 51). The primer pair name designation “HUMMTDNA” refers to human mitochondrial DNA. Primer pair numbers 2901 and 2925 are designed to produce an amplification product corresponding to the same sub-segment defined by Anderson mitochondrial DNA coordinates 15924 . . . 15985 (see Table 2). This extent of redundancy is sometimes beneficial in cases where high variability occurs at chosen primer binding sites such that a given primer of a primer pair does not effectively hybridize to the mitochondrial DNA of certain individuals. For this reason, 25 primer pairs are used to obtain amplification products of 24 sub-segments.

TABLE 1 Primer Pairs Used for Amplifying HV1 and HV2 Regions of Mitochondrial DNA Primer Forward Forward Reverse Reverse pair primer Forward SEQ ID primer Reverse SEQ ID number name sequence NO: name sequence NO: 2889 HUMMTDNA_ TCTCGTCCCC  1 HUMMTDNA_A TCGAGGAGAGT 26 ASN_16357 ATGGATGACC SN_16429_1 AGCACTCTTGT _16376_F 6451_R G 2890 HUMMTDNA_ TGCCATTTAC  2 HUMMTDNA_A TGGTCAAGGGA 27 ASN_16318 CGTACATAGC SN_16382_1 CCCCTATCTG _16341_F ACAT 6402_R 2891 HUMMTDNA_ TCACCCCTCA  3 HUMMTDNA_A TGGGACGAGAA 28 ASN_16256 CCCACTAGGA SN_16345_1 GGGATTTGACT _16282_F TACCAAC 6366_R 2892 HUMMTDNA_ TCACACATCA  4 HUMMTDNA_A TGCTATGTACG 29 ASN_16231 ACTGCAACTC SN_16306_1 GTAAATGGCTT _16253_F CAA 6338_R TATGTACTATG 2893 HUMMTDNA_ TAGTACATAA  5 HUMMTDNA_A TGGTGAGGGGT 30 ASN_16154 AAACCCAATC SN_16251_1 GGCTTTG _16181_F CACATCAA 6268_R 2894 HUMMTDNA_ TAGTACATAA  6 HUMMTDNA _A TGGTGAGGGGT 31 ASN_16154 AAACCCAATC SN_16251_1 GGCTTTG _16181_2_ CACATCAG 6268_R F 2895 HUMMTDNA_ TTTCCATAAA  7 HUMMTDNA_A TGGGTTGATTG 32 ASN_16130 TACTTGACCA SN_16202_1 CTGTACTTGCT _16156_F CCTGTAG 6224_R T 2896 HUMMTDNA_ TACTGCCAGC  8 HUMMTDNA_A TGGGTTGATTG 33 ASN_16102 CACCATGAAT SN_16202_1 CTGTACTTGCT _16123_F AT 6224_R T 2897 HUMMTDNA_ TCCAAGTATT  9 HUMMTDNA_A TACAGGTGGTC 34 ASN_16055 GACTCACCCA SN_16130_1 AAGTATTTATG _16077_F TCA 6155_R GTAC 2898 HUMMTDNA_ TCTTTCATGG 10 HUMMTDNA_A TCATGGTGGCT 35 ASN_16025 GGAAGCAGAT SN_16099_1 GGCAGTAATG _16047_F TTG 6119_R 2899 HUMMTDNA_ TGCACCCAAA 11 HUMMTDNA_A TGGTGAGTCAA 36 ASN_15985 GCTAAGATTC SN_16052_1 TACTTGGGTGG _16014_F TAATTTAAAC 6073_R 2901 HUMMTDNA_ TGGGGTATAA 12 HUMMTDNA_A TTAAATTAGAA 37 ASN_15893 ACTAATACAC SN_15986_1 TCTTAGCTTTG _15923_F CAGTCTTGTA 6012_R GGTGC A 2902 HUMMTDNA_ TCAGGTCTAT 13 HUMMTDNA_A TGTCTCGCAAT 38 ASN_5_30_ CACCCTATTA SN_77_97_R GCTATCGCGT F ACCACT 2903 HUMMTDNA_ TATTAACCAC 14 HUMMTDNA_A TTTCAAAGACA 39 ASN_20_40 TCACGGGAGC SN_115_139 GATACTGCGAC _F T _R ATA 2904 HUMMTDNA_ TAGCATTGCG 15 HUMMTDNA_A TGCCTGTAATA 40 ASN_83_10 AGACGCTGGA SN_163_187 TTGAACGTAGG 2_F _R TGC 2905 HUMMTDNA_ TCTATGTCGC 16 HUMMTDNA_A TGGGTTATTAT 41 ASN_113_1 AGTATCTGTC SN_218_245 TATGTCCTACA 37_F TTTGA _R AGCATT 2906 HUMMTDNA_ TCCTTTATCG 17 HUMMTDNA_A TGGTTGTTATG 42 ASN_154_1 CACCTACGTT SN_268_290 ATGTCTGTGTG 77_F CAAT _R G 2907 HUMMTDNA_ TAACAATTGA 18 HUMMTDNA_A TGTTTTTGGGG 43 ASN_239_2 ATGTCTGCAC SN_341_363 TTTGGCAGAGA 62_F AGCC _R T 2908 HUMMTDNA_ TGTGTTAATT 19 HUMMTDNA_A TCTGTGGCCAG 44 ASN_204_2 AATTAATGCT SN_314_330 AAGCGG 33_F TGTAGGACAT _R 2910 HUMMTDNA_ TCTTAAACAC 20 HUMMTDNA_A TAAAAGTGCAT 45 ASN_331_3 ATCTCTGCCA SN_402_425 ACCGCCAAAAG 54_F AACC _R AT 2912 HUMMTDNA_ TGCGGTATGC 21 HUMMTDNA_A TGTGTGTGCTG 46 ASN_409_4 ACTTTTAACA SN_502_521 GGTAGGATG 30_F GT _R 2913 HUMMTDNA_ TCTCCCATAC 22 HUMMTDNA_A TGCTTTGAGGA 47 ASN_464_4 TACTAATCTC SN_577_603 GGTAAGCTACA 92_F ATCAATACA _R TAAAC 2916 HUMMTDNA_ TACCCTAACA 23 HUMMTDNA_A TGGAGGGGAAA 48 ASN_367-3 CCAGCCTAAC SN_438_463 ATAATGTGTTA 88_F CA _R GTTG 2923 HUMMTDNA_ TGCTTTCCAC 24 HUMMTDNA_A TCTGGTTAGGC 49 ASN_262_2 ACAGACATCA SN_368_390 TGGTGTTAGGG 88_F TAACAAA _R T 2925 HUMMTDNA_ TCCTTTTTCC 25 HUMMTDNA_A TGCTTCCCCAT 50 ASN_15937 AAGGACAAAT SN_16018_1 GAAAGAACAGA _15962_F CAGAGA 6041_R GA

TABLE 2 Amplification Coordinates of Mitochondrial DNA for the Primer Pairs of Table 1 Primer pair Amplification number Coordinates mtDNA Region 2889 16377 . . . 16428 HV1 2890 16342 . . . 16381 HV1 2891 16283 . . . 16344 HV1 2892 16254 . . . 16305 HV1 2893 16182 . . . 16250 HV1 2894 16182 . . . 16250 HV1 2895 16157 . . . 16201 HV1 2896 16124 . . . 16201 HV1 2897 16078 . . . 16129 HV1 2898 16048 . . . 16098 HV1 2899 16015 . . . 16051 HV1 2901 15924 . . . 15985 HV1 2902 31 . . . 76 HV2 2903  41 . . . 114 HV2 2904 103 . . . 162 HV2 2905 138 . . . 217 HV2 2906 178 . . . 267 HV2 2907 263 . . . 340 HV2 2908 234 . . . 314 HV2 2910 355 . . . 402 HV2 2912 431 . . . 501 HV2 2913 493 . . . 576 HV2 2916 389 . . . 437 HV2 2923 289 . . . 371 HV2 2925 15924 . . . 15985 HV1

Example 2 Validation of Triplex Tiling Mitochondrial DNA Assay

The 25 primer pairs of Table 1 were divided into triplex combinations of three primer pairs such that the amplification products of three primer pairs within a triplex combination have sense and antisense strands which are significantly different in molecular mass from the other sense and antisense strands of other amplification products within the triplex combinations. The triplex combinations are shown in Table 3 with reference to primer pair combinations.

TABLE 3 Triplex Combinations of Primer Pairs for Simultaneous Analysis of Mitochondrial DNA Regions Triplex Combination Primer Pair Primer Pair Primer Pair No. Number Number Number 1 2892 2901 2906 2 2891 2908 2925 3 2890 2899 2907 4 2898 2889 2923 5 2902 2910 2893/2894 6 2916 2897 2893 7 2904 2896 2913 8 2895 2912 2905

PCR cycle conditions used for obtaining amplification products for this assay are as follows: 10 minutes at 96° C. followed by six cycles of steps (a) to (c) wherein: (a) is 20 seconds at 96° C., (b) is 1.5 minutes at 55° C., and (c) is 1 second at 72° C., followed by 36 cycles of steps (d) to (f) wherein (d) is 20 seconds at 96° C., (b) is 1.5 minutes at 50° C., and (c) is 1 second at 72° C., followed by a retention at 4° C. All PCR reactions were carried out with an Eppendorf thermal cycler with 40 μl reaction volumes in a 96-well microtiter plate format. Liquid manipulations were performed using a Packard MPII liquid handling robotic platform. The PCR reaction mixture consisted of 4 units of Amplitaq Gold, 1× buffer II (Applied Biosystems, Foster City, Calif.), 1.5 mM MgCl₂, 800 μM dNTP mixture and 250 nM of each primer. The dNTP mixture contained carbon-13 enriched deoxyguanosine triphosphate, a chemically invisible molecular mass-modifying tag which adds 10 Da to each G residue incorporated into a given amplification product so that the numbers of possible base compositions consistent with a measured molecular mass is reduced and the probability of assignment of an incorrect base composition to a given amplification product is greatly decreased.

Eleven saliva samples were obtained from in-house laboratory personnel and subjected to PCR reactions as described above with the 8 triplex primer pair sets shown in Table 3. The PCR amplification products were purified according to the primary amine-terminated magnetic bead separation method; a technique that is well known in the art and that is described in US patent publication 20050130196 which is incorporated herein by reference in entirety. All amplification products were analyzed using a Bruker Daltonics MicroTOF™ mass spectrometer. Ions from the ESI source undergo orthogonal ion extraction and are focused in a reflectron prior to detection. The TOF and FTICR are equipped with the same automated sample handling and fluidics described above. Ions are formed in the standard MicroTOF™ ESI source that is equipped with the same off-axis sprayer and glass capillary as the FTICR ESI source. Consequently, source conditions were the same as those described above. External ion accumulation was also employed to improve ionization duty cycle during data acquisition. Each detection event on the TOF was comprised of 75,000 data points digitized over 75 μs.

Mass spectra of the amplification products were analyzed independently using a maximum-likelihood processor, such as is widely used in radar signal processing. This processor, referred to as GenX, first makes maximum likelihood estimates of the input to the mass spectrometer for each primer by running matched filters for each base composition aggregate on the input data. This processor is described in U.S. Patent Application Publication No. 20040209260 which is incorporated herein by reference in entirety.

All duplicate reactions were analyzed independently and duplicate results were identical in all cases. An example of a mass spectrum of triplex primer combination 1 (primer pair nos. 2892, 2901 and 2906) is shown in FIG. 2 wherein each of the peaks labeled A-F represent a single strand of DNA of an amplification product. The strands are clearly separated which facilitates efficient analysis of the molecular masses.

The applicability of the present invention for resolution of mitochondrial DNA heteroplasmy is indicated in FIG. 3. Strands C′, D′, C″ and D″ represent two amplification products having length heteroplasmy of the amplification product of strands C and D. Each of the strands of the heteroplasmic variants is visible in the mass spectrum because they vary in molecular mass.

Example 3 Rapid Typing of Human Mitochondrial DNA

Mitochondrial DNA (mtDNA) analysis of forensic samples is performed when the quantity and/or quality of DNA are insufficient for nuclear DNA analysis, or when DNA analysis through a maternal lineage is otherwise desired. Forensic mtDNA analysis is performed by sequencing portions of the mtDNA genome, which is a lengthy and labor intensive technique. We present a mass spectrometry-based multiplexed PCR assay suitable for automated analysis of mtDNA control region segments. The assay has been internally validated with 20 DNA samples with known sequence profiles and 50 blinded samples contributed by external collaborators. Correct profiles were obtained in all cases when compared to sequencing data. Two samples containing mixed templates were observed and the relative contribution of each template was quantified directly from the mass spectra of PCR products.

The primer pairs of Table 1 were designed to amplify 1051 bases of human mitochondrial DNA in the hypervariable regions HV1 and HV2. The primer pairs were combined in multiplex reactions in groups which were chosen such that the target segments of the three primer pairs being combined were maximally separated and such that each of the three amplification product masses in a triplex mixture were resolvable from each other by mass spectrometry. The triplex groups are shown in Table 3. The lengths of the amplification products were 85 to 140 base pairs. All except for three amplification products were less than 130 base pairs in length. The relative primer pair concentrations in the triplex mixtures were adjusted in order to favor simultaneous amplification of all three target segments.

Mass spectra were measured by electrospray time-of-flight (TOF) mass spectrometry.

A standard reference human mitochondrial DNA database was used to obtain the base composition profiles corresponding to the series of amplification products produced by the overlapping primer pairs. As described above, the database was populated with base composition data from the Anderson reference mitochondrial DNA, from base composition measurements earlier obtained, and by conversions from databases of earlier obtained sequencing data. These base composition profiles represent the “truth data.”

Fifty blinded test samples, including 25 blood samples and 25 cheek swab samples were tested and compared to the pre-existing truth data. Mitochondrial DNA was purified from the samples by the Qiagen blood punch protocol or by the Qiagen buccal swab protocol and quantified using the Quantifiler qPCR kit prior to analysis. Two or more independent assays were performed with the overlapping primers of Table 1 using between 100 and 500 pg of mitochondrial DNA in each reaction.

The purified mitochondrial DNA was subjected to triplex PCR amplification with the eight triplex primer groups of Table 3 according to the procedure indicated in Example 2. Amplified mixtures were purified by solution capture of nucleic acids with ion exchange resin linked to magnetic beads as follows: 25 μl of a 2.5 mg/mL suspension of BioClone amine terminated superparamagnetic beads were added to 25 to 50 μl of a PCR (or RT-PCR) reaction containing approximately 10 pM of a typical PCR amplification product. The above suspension was mixed for approximately 5 minutes by vortexing or pipetting, after which the liquid was removed after using a magnetic separator. The beads containing bound PCR amplification product were then washed three times with 50 mM ammonium bicarbonate/50% MeOH or 100 mM ammonium bicarbonate/50% MeOH, followed by three more washes with 50% McOH. The bound PCR amplicon was eluted with a solution of 25 mM piperidine, 25 mM imidazole, 35% MeOH which included peptide calibration standards.

Each mass spectrum obtained by ESI-TOF mass spectrometry was independently calibrated by internal peptide calibrants and noise-reduced prior to calculation of base composition. Base compositions were obtained from molecular masses and compared to a database developed from over 110,000 mitochondrial DNA sequences. The base composition of each amplification product was associated with mitochondrial DNA coordinates as shown, for example in Table 4 which provides the base compositions for sample AF-12 from the set of 50 blinded samples.

TABLE 4 Mitochondrial DNA Base Composition Profile for Sample AF-12 Anderson/Cambridge Sequence Coordinates (SEQ ID NO: 51) Base Composition 15893 . . . 16012 A47 G18 C25 T30 15937 . . . 16041 A35 G14 C24 T32 15985 . . . 16073 A26 G15 C21 T27 16025 . . . 16119 A26 G17 C26 T26 16055 . . . 16155 A31 G13 C30 T27 16102 . . . 16224 A45 G13 C42 T23 16130 . . . 16224 A36 G7 C33 T19 16154 . . . 16268 A44 G7 C46 T18 16231 . . . 16338 A40 G9 C40 T19 16256 . . . 16366 A37 G9 C41 T24 16318 . . . 16402 A20 G14 C30 T21 16357 . . . 16451 A21 G17 C36 T21  5 . . . 97 A19 G24 C24 T26  20 . . . 139 A24 G34 C29 T33  83 . . . 187 A23 G21 C29 T32 113 . . . 245 A39 G18 C28 T48 154 . . . 290 A49 G17 C31 T40 204 . . . 330 A42 G16 C35 T32 204 . . . 330 A42 G16 C36 T32 204 . . . 330 A42 G16 C37 T32 239 . . . 363 A43 G11 C46 T23 239 . . . 363 A43 G11 C47 T23 239 . . . 363 A43 G11 C48 T23 239 . . . 363 A43 G11 C49 T23 262 . . . 390 A47 G10 C50 T20 262 . . . 390 A47 G10 C51 T20 262 . . . 390 A47 G10 C52 T20 262 . . . 390 A47 G10 C53 T20 331 . . . 425 A33 G9 C27 T26 367 . . . 463 A27 G8 C32 T30 409 . . . 521 A32 G7 C48 T26 464 . . . 603 A44 G10 C63 T23

Heteroplasmy was detected in several of the samples. For example, sample AF-4 has C

T heteroplasmy at position 16176. Two distinct amplification products having base compositions of A45 G13 C41 T24 and A45 G13 C40 T25 were obtained for this sample using primer pair number 2896 which amplifies positions 16102 . . . 16224. If conventional sequencing analyses were used to analyze the amplification reaction mixture, heteroplasmy would not have been detected. Table 5 indicates additional examples of heteroplasmy detected in various samples.

TABLE 5 Summary of Heteroplasmy Detection in Selected Samples Blinded Approximate % of Sample Region Heteroplasmy Minor Product AF-2 16231 . . . 16338 C → T 32.4 16256 . . . 16366 AF-4 16102 . . . 16224 C → T 49.2 16130 . . . 16224 AF-7 16318 . . . 16402 T → C 10.2 AF-9 464 . . . 603 AC insertion 17.3 AF-19 15985 . . . 16073 A → G 44.9 16025 . . . 16119 AF-22 6102 . . . 16224 C → A 36.2 16130 . . . 16224 AF-24 464 . . . 603 AC deletion 13.5 FBI-22 16055 . . . 16155 A → C 7.0 FBI-37 16231 . . . 16338 C → T 20.0 16256 . . . 16366 FBI-48 16055 . . . 16155 T → G 6.0 FBI-49 154 . . . 290 A → C 10.6 FBI-51  5 . . . 97 C → T 43.0  20 . . . 139 FBI-57 16357 . . . 16451 T → C 6.0 FBI-61 464 . . . 603 AC insertion 17.0 FBI-66 113 . . . 245 C → T 50.0 154 . . . 290 FBI-72 113 . . . 245 C → T 34.0 154 . . . 290

The results of the investigation of the 50 blinded samples indicated that 47 of 47 pure samples were directly concordant with the sequence data available. One negative (no mitochondrial DNA present) was confirmed as negative and two buccal swab samples were confirmed as mixtures of existing buccal swab samples. Deduction of contributors to mixtures was confirmed as accurate. Multiple examples of length heteroplasmy and single nucleotide polymorphism heteroplasmy were observed. These results indicate that the method is useful for rapid typing of human mitochondrial DNA.

Example 4 Demonstration of the Feasibility of Rapid Detection of a Genetic Engineering Event

To detect a genetic engineering event indicated by the presence of foreign DNA sequences inserted into a parent virus, a strategy of overlapping PCR primers to tile large sections of viral genomes is employed. Primer binding sites were chosen such that the PCR amplicon length (standard segments) will be approximately 150 nucleobases in length with overlapping segments defined by primer hybridization regions every 50-100 nucleobases across the entire target region (in a manner exemplified by FIG. 1).

Target regions are chosen according to expectation of identification of a genetic engineering event at a particular region. For example, if it is known that “region X” of a genome of a given virus is known to be a common insertion point for a gene encoding a toxin used as a biowarfare agent, it would be advantageous to simplify the base composition analysis by choosing only the genomic coordinates of region X as the target (a portion of the genome chosen as the target). The target region is then divided into sub-segments and primer pairs are chosen to obtain amplification products which represent the sub-segments for base composition analysis. On the other hand, if it is known that any point in an entire genome is appropriate for insertion of a gene, it would be advantageous to define the entire genome as the target in order to ensure that the insertion is detected. One with ordinary skill will recognize that defining an entire genome as a target will require design of many more primer pairs and significantly more analysis resources.

A database of molecular masses and base compositions for each standard segment for the standard target virus species will be used to assemble a base composition map of each sampled region from the mass spectrum derived from each amplification reaction. The identification of at least one amplification product whose base composition differs from the base composition of its corresponding standard segment in one or more overlapping tiled regions will indicate that a variant exists and the sample will be flagged for further analysis. SNP variants are readily recognized and can be directly analyzed by the methods described herein. As an example of the proposed method, 10 Kb nucleobase regions of orthopoxvirus species genetically engineered with a green-fluorescent protein (GFP) construct are inserted into analogous regions in five different orthopoxviruses which will serve as benign surrogates to represent a potentially deadly engineered virus.

In the following proof-of-concept example using the recombinant GFP-containing camelpoxvirus (CMPV-GFP), simulated processed mass spectrometry data was used to reconstruct a standard segment base composition map, associate it unambiguously to CMPV, and identify presence of a foreign insert in the virus by flagging an unexpected/unmatched hole in two of the amplified regions. Overlapping primer pairs were selected to span the CMPV-GFP sequence. A theoretical prediction of the expected standard amplification products using these primers was used to populate a database that serves as an expected mass set for all poxvirus species. Processed mass spectrometry data of the amplified regions of CMPV-GFP were simulated and matched against the database of 16 poxvirus sequences (which did not include the GFP-engineered sequence) to construct a base composition profile of each region. The base composition profile is generated using the full set of potential fragments from all database sequences, which helps increase profile coverage in the case of strain-to-strain SNP variations. If any SNP-generated fragments appear that do not occur in any database sequence, the base composition of the double-stranded fragment can be deduced directly from the masses. The final base composition profile for each region can then be compared to the compositions for all database sequences to confirm/refine the identity of the parent virus. The presence of an unmatched “hole” in the assembled profile that cannot be matched to the expected viral sequence indicates the potential presence of an engineered insert. This region may then be sequenced and compared to the full sequence database via BLAST. The ability to rapidly identify the presence of the insert, the location of the insertion, and the flanking regions of the viral genome where the unexpected genetic modification was done will serve as a powerful tool to flag potential bioengineering events. It further reduces the burden of sequencing to specific, targeted regions of the viral genome instead of the entire virus from every sample.

Example 5 Vector Validation and Characterization of Vector Heterogeneity

This example illustrates a scenario where the method of the present invention could be used to validate and/or characterize heterogeneity of standard nucleic acid sequences encoding biological products. The process of production of biological therapeutic proteins such as vaccines and monoclonal antibodies requires storage and manipulation of the nucleic acid sequences encoding the therapeutic proteins. Mutations may occasionally arise within a given nucleic acid sequence encoding the protein and compromise its therapeutic effect. It is desirable to have a method for rapid validation of such nucleic acid sequences and characterization of heterogeneity of the sequences, if present.

Vector X contains a nucleic acid sequence encoding vaccine Y which is used to vaccinate individuals against infection of virus Z. Vector X is used to transfect a suitable host for production of vaccine Y. Vaccine Y is suspected of being compromised by a mutation that has arisen in the nucleic acid sequence encoding vaccine Y and is being propagated via routine laboratory manipulations of vector X.

The method of the present invention is used to analyze the nucleic acid of vector X by base composition analysis of sub-segments of the vector which encode vaccine Y. The nucleic acid sequence encoding vaccine Y is 300 nucleobases in length. This sequence is divided into four sub-segments as follows: sub-segment 1 represents coordinates 1 . . . 100 of the nucleic acid sequence encoding vaccine Y; sub-segment 2 represents coordinates 61 . . . 160 of the nucleic acid sequence encoding vaccine Y; sub-segment 3 represents coordinates 141 . . . 240 of the nucleic acid sequence encoding vaccine Y; and sub-segment 4 represents coordinates 221 . . . 300 of the nucleic acid sequence encoding vaccine Y. The base compositions of each of the four sub-segments are known because the sequence of vaccine Y is known. Sub-segment 1 of the nucleic acid of vaccine Y has a base composition of A₂₅T₂₀C₃₀ G₂₅; sub-segment 2 of the nucleic acid of vaccine Y has a base composition of A₁₅T₂₀ C₃₅ G₃₀; sub-segment 3 of the nucleic acid of vaccine Y has a base composition of A₂₀T₂₅ C₃₀ G₂₅; and sub-segment 4 of the nucleic acid of vaccine Y has a base composition of A₂₅ T₁₅ C₁₅ G₂₀. Primer pair 1 is used to obtain an amplification product of vector X wherein the amplification product corresponds to sub-segment 1. Primer pair 2 is used to obtain an amplification product of vector X wherein the amplification product corresponds to sub-segment 2. Primer pair 3 is used to obtain an amplification product of vector X wherein the amplification product corresponds to sub-segment 3. Primer pair 4 is used to obtain an amplification product of vector X wherein the amplification product corresponds to sub-segment 4. The amplification products corresponding to sub-segments 1-4 are analyzed by mass spectrometry to determine their molecular masses. The base compositions of one or more of the amplification products are calculated from the molecular masses and compared with the base compositions of the sub-segments of vaccine Y listed above.

In one example, production lot A-1 of vector X is analyzed according to the method described above. The results of the base composition calculations indicate that each of the experimentally determined base compositions of the amplification products match the base compositions of the four sub-segments. The conclusion of this exercise is that vector X and the nucleic acid encoding vaccine Y contained thereon, do not contain mutations and that the vaccine vector is validated, indicating that future vaccine production will not be affected.

In another example, production lot B-2 of vector X is analyzed according to the method described above. The results of the base composition calculations indicate that each of the experimentally determined base compositions of the amplification products match the base compositions of the four sub-segments. An additional amplification product is observed in the mass spectrum of the amplification reaction of primer pair 3. The additional amplification product which corresponds to sub-segment 3 has a base composition of A₂₀ T₂₅ C₃₁ G₂₄. This indicates that the additional amplification product has a G→C substitution relative to the standard base composition of sub-segment 3. The conclusion of this exercise is that vector X and the nucleic acid encoding vaccine Y are heterogeneous and that production of vaccine Y from production lot B-2 of vector X may be compromised. The mass spectrum indicating signals from two amplification products corresponding to sub-segment 3 may also be used to estimate the relative amounts of the two amplification products, thereby further characterizing the extent of heterogeneity of the nucleic acid sequence encoding vaccine Y. If the relative quantity of nucleic acid containing the mutation is low, it may be decided that heterogeneity is negligible. On the other hand, if the relative quantity of nucleic acid containing the mutation is high, it may be decided that vector X lot B-2 is severely compromised and should be destroyed instead of being used to produce vaccine Y.

Various modifications of the invention, in addition to those described herein, will be apparent to those skilled in the art from the foregoing description. Such modifications are also intended to fall within the scope of the appended claims. Each reference (including, but not limited to, journal articles, U.S. and non-U.S. patents, patent application publications, international patent application publications, gene bank accession numbers, internet web sites, and the like) cited in the present application is incorporated herein by reference in its entirety. Those skilled in the art will appreciate that numerous changes and modifications may be made to the embodiments of the invention and that such changes and modifications may be made without departing from the spirit of the invention. It is therefore intended that the appended claims cover all such equivalent variations as fall within the true spirit and scope of the invention. 

1. A method for analyzing a nucleic acid comprising the steps of: (a) obtaining a sample comprising nucleic acid for base composition analysis; (b) selecting at least two primer pairs that will generate overlapping amplification products of at least two sub-segments of the nucleic acid; (c) amplifying at least two nucleic acid sequences of a region of the nucleic acid designated as a target for base composition analysis using the at least two primer pairs, thereby generating at least two overlapping amplification products; (d) determining base compositions of the amplification products by; (i) measuring molecular masses of one or more of the amplification products generated in step (c) using a mass spectrometer; and (ii) converting one or more of the measured molecular masses to base compositions; (e) comparing one or more of the base compositions with a source of reference base composition data for the nucleic acid sequence; and (f) identifying the presence of a particular nucleic acid sequence or variant thereof. 2-36. (canceled) 